Session 1: Bash script / Command line

Aims

This workshop will introduce you to working with the command line in the Bash shell and using basic UNIX commands to manipulate files and their contents.

We will go through the entire workshop together. The workshop consists of both lectures and hands-on exercises. I have prepared an HTML file that you can use as a guide. This file will serve as a reference during the workshop and for your future analyses.

Important: Please make sure to think critically and understand the commands you are running, why you are running them, and what options are available. Always use the man command or the --help option to explore additional features of a command.

Operating System for Interactive Sessions

For macOS or Linux/UNIX-based systems:

Simply open a terminal window. You can do this by searching for “Terminal” in your system’s search bar. A command prompt window should appear, ready for use.

For Windows systems:

Download and install PuTTY from link:[https://www.putty.org/]. Use PuTTY to log in to your Amazon cloud instance.

Alternatively, you can use the pre-installed Windows PowerShell, which functions similarly to the Linux command line. Note that there are differences in commands between PowerShell and Linux terminals. You can refer to this link:[https://blog.netwrix.com/powershell-commands-cheat-sheet] for guidance.

The command line

To execute the command, type the name of the command at the prompt.

ls
## 0.5
## Calanus.sh
## E6609_SampleSheet.csv
## Pop-A-L-37_1.fastq.gz
## Pop-A-L-37_2.fastq.gz
## Pop-A-L-7_1.fastq.gz
## Pop-A-L-7_2.fastq.gz
## Pop-A-S-20_1.fastq.gz
## Pop-A-S-20_2.fastq.gz
## PopA-S-1_1.fastq.gz
## PopA-S-1_2.fastq.gz
## PopC-L-11_1.fastq.gz
## PopC-L-11_2.fastq.gz
## PopC-L-2_1.fastq.gz
## PopC-L-2_2.fastq.gz
## PopC-S-16_1.fastq.gz
## PopC-S-16_2.fastq.gz
## PopC-S-4_1.fastq.gz
## PopC-S-4_2.fastq.gz
## Rmarkdown_tutorials.Rmd
## Rmarkdown_tutorials.html
## Rplot.pdf
## Rplot01_pca_plot_8_indv_square.png
## Rplot01_pca_plot_square.pdf
## Rplot_.frq.png
## Rplot_PCA.png
## Rplot_l_depth.png
## Rplot_lmiss.png
## Rplot_lqual.png
## Rplot_milkfish_pca_8_indv.png
## Rplot_milkifish_8_indv.pdf
## bcl2fastqshell.sh
## big_data.fastq
## extracted_contigs.fastq
## extracted_contigs.fastq.gz
## fastqc_html
## full_filter_nomaf_milkfish_snps_filtered.vcf.gz
## hello.sh
## html_002.html
## html_00_Data_&_Software_Installation.Rmd
## html_00_Data_Software_Installation.html
## html_02_Bash_scripting.Rmd
## html_02_Bash_scripting.html
## html_03.Rmd
## html_03.html
## html_03.log
## html_03.tex
## html_ppt_04.html
## html_ppt_05.Rmd
## loop.sh
## milkfish_edited.sh
## milkfish_filtered.frq
## milkfish_filtered.het
## milkfish_filtered.idepth
## milkfish_filtered.imiss
## milkfish_filtered.ldepth.mean
## milkfish_filtered.lmiss
## milkfish_filtered.log
## milkfish_filtered.lqual
## milkfish_pca_nomaf.eigenval
## milkfish_pca_nomaf.eigenvec
## milkfish_snps_filtered.vcf.gz
## names_sorted.txt
## newfile.txt
## participants.txt
## pca
## rsconnect
## seqtk
## session1_file1.txt
## session1_file2.txt

ls is a shortcut for list which corresponds to listing all names of the files contained in the directory

Getting help - manual pages

If you type:

man ls

You will bring up the manual for that command. Please note that you can use manand almost all basic executebale shell command line or alternatively use --help to output basic information about the command.

Working in the command line

Navigation - where am I?

pwd
## /Users/apollo/Documents/Popgen_Workshop/Day_001

pwdmeans present working directory. It outputs your current working directory or simply where is your current location. This is very important especially if you want to know the path where your input or output files are located or where the software you are going to use is located in case you haven’t put them in the systems directory (will discuss this later on)

File Manipulation

Making directories

make a directory


mkdir test
Fig 1. mkdir test .
Fig 1. mkdir test .

Question

A new test folder is created.

Now try to remove the folder.

What code are you going to use?


rm -r test
Fig 2. removing test folder / directory .
Fig 2. removing test folder / directory .

Deleting files and directories

To delete EMPTY directories from the system, you can use rmdir (remove directory) command.


rmdir DIRECTORY

To make an empty file you can use touch


touch newfile.txt

To delete a file, the ‘rm’ command can be used


rm newfile.txt

Copying files

to copy a file use the cp command

You have to specify the path to the file you want to copy and its destination.


cp newfile.txt

Copying folders

lets say I want you to copy the folder0.5 to your home directory


cp -r 0.5/ /Users/apollo or use `~``

cp -r 0.5/ ~

Copying a file

just type cp name of the file you want to copy and the folder where you wanted to copy the file. Careful on where you save the file as you can overwrite the file itself.

Let say I want to copy newfile.txt in the 0.5/ directory

I simply type:


cp newfile.txt 0.5/

then type ls

to check if the file i copied is in the 0.5/ folder

File permissions

Every file on the system has a set of permissions that determine who can read, change or delete, or execute the file.

By default, all files you create in your account are readable, changeable or executable by you.

Other files, not created by you may have different permissions

Viewing file permissions

To see the permission settings for a file or files in a folder, we can use the ls command as follows:


ls -l

or


ls -l filename

Now, try to check the permission of each of the files saved on your directory. What does it say?

File permissions

File permissions are split into groups of threes, and each position in the group denotes a specific permission, in this order: read (r), ´write (w)´, ´execute (x)´ - ´rwx´

The first three characters (2-4) represent the permissions for the ´file’s owner´. -rwxr-xr– represents that the owner has read (r), write (w) and execute (x) permission.

The second group of three (5-7) are the permissions for the ´group to which the file belongs´. -rwxr-xr– represents that the group has read (r) and execute (x) permission, but no write permission.

The last group (8-10) represents the permissions for ´everyone else´. For example, -rwxr-xr– represents that there is read (r) only permission.

Note

Sometimes you will need to change the permissions on files in order to execute them.

The command to change permissions is chmod

You have to specify who you are modifying the permissions of, what the new permissions are, and what file or directory to act on.

For example, You can quickly change file permission using codes. This will allow a file to be executed which is important for scripts.


chmod 777 filename
Fig 3. File permission codes .
Fig 3. File permission codes .

Viewing files

There are many ways to view a text file. One way for simple viewing is to type:


less filename

Use less to look at a big files in your directory try to view big_data.fastq


less big_data.fastq

The less command displays as much of the file as can fit onto the screen. To scroll up and down within the document, use the arrow keys. Hitting the space bar will bring a new screen-full of information.

Fig 4. File permission codes .
Fig 4. File permission codes .

To search forward in the file for a given pattern, click cmd (ctrl) + f, then type the pattern you wanted to search in the file. For example, I wanted to look for sequence that contains ´AAACCCGGGTTT´

Remember that you need to be in the less mode to perform this command.

To exit the less program and return to the prompt, press:


q

##Viewing part of the file

tail

The tail command displays the last few lines of a file. By default tail will show the last ten lines of a file, but you can tell it how many lines to display.


tail -n 10 big_data.fastq
## +
## II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9III*IIIIIIII9IIIIIIIIIIIIIIII
## @LH00684:49:22JCVJLT4:7:2498:49500:29767 1:N:0:CATGAGCA
## AANGGCAAGCTGTAAGTAGCAGGTATCACAACTGGTGGGGTTCATGGTTGTCTTCAATTAACCGAACATAGAGGGATGAGCTGGATCACGGCTGGCAGGGATCACAGTTCTGCTCAGCAGGAGATCGGAAGAGCACACGTCTGAACTCCAG
## +
## II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIII
## @LH00684:49:22JCVJLT4:7:2498:51813:29767 1:N:0:CATGAGCA
## TGNTGTCTGACTAGGTCAGCAAGGGTCAGAGGAATCGGAGCGTCAGGACACTGGCCTTTCTTTCCTAATCCTGATAATTGCTTAGAGAGAGCCGATGTCTCACAACCTCCCGAAATGTCTTTCATTGTCTCCTCTCTAACAAGACCTGGGG
## +
## II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9I9IIIIIIIIIIIII*99IIII99IIII9II9IIIIII9IIII9IIII*IIIII9I9I9I9*9IIII*IIII9II99IIIII9I9I*99III9IIII9I9I*

shows the last ten lines of that file

You can also use the -f option which appends all new data to screen and is very useful for tracking the progress of software – e.g. mrBayes.


tail -f -n  10 big_data.fastq

paste

paste is a lovely command that does all sorts of interesting things paste can transpose data from a single column into a single row (delimited by whatever you like: tabs, commas, spaces etc).

This will transpose the contents of the session1 file into a single line.

First, let’s view the file using less


less session1_file1.txt
## Mia 
## Felix 
## KC 
## Inggat 
## TJ 
## Errol 
## Belay 
## Tabards 
## Gela 
## Anj 
## Raf 
## Tin 
## Aye 
## Janelle 
## Apollo 
## LeQin 
## Syd 
## Reina 
## Rachel

It should be a single column


paste -s session1_file1.txt
## Mia  Felix   KC  Inggat  TJ  Errol   Belay   Tabards     Gela    Anj     Raf     Tin     Aye     Janelle     Apollo  LeQin   Syd     Reina   Rachel

try specifying a delimiter in this case “,”


paste -s -d"," session1_file1.txt
## Mia ,Felix ,KC ,Inggat ,TJ ,Errol ,Belay ,Tabards ,Gela ,Anj ,Raf ,Tin ,Aye ,Janelle ,Apollo ,LeQin ,Syd ,Reina ,Rachel

paste can also intersperse all the rows of two files into a single combined file.

Have a look at session1_file1.txt and session1_file2.txt using less or cat


paste -d"\n" session1_file1.txt session1_file2.txt > participants.txt

Now view the file participants.txt

What did you see?

Challenge 1: Can you try to paste the ´namesandsurname` together side-by-side?

Really useful for all sorts of things when messing about with file formatting. e.g. turning a list of snp identifiers (headers) with one identifier per genotype into 1 identifier per allele.

How would you do that?

cat and redirect


cat session1_file1.txt > newfile.txt

Creates a new file (new file) with same contents as old file (session1_file1.txt)

Now try:


cat session1_file1.txt >> session1_file2.txt

Appends the contents for file1 to file2, equivalent to opening file1, copying all the contents, pasting the copied contents to the end of the file2 and saving it.


cat session1_file1.txt session1_file1.txt session1_file1.txt > newfile2.txt

Copies contents of files 1-3 into file 4.

Notes: If you have lots of files and each of these files contain a single FASTA file.

You can combine them all together to make a single file “sequences.fasta” using redirects.


cat *.fas >> sequences.fasta

will combine all .fas files into a single file.

split

This command splits a file into a series of smaller files.

The content of the input file is split into ordered files named with the prefix “x”, unless another prefix is provided as argument to the command.

The switch -l lets you specify how many lines to include in each file.

Split the content into separate files of 3 lines each and output to new files prefixed with the name small


split -l 3 session1_file1.txt small

What happened? Can you tell us?

Grep - General Regular Expression Print

This command prints the lines of the input file that match the given pattern(s).

Very useful for file manipulation.

Excellent for searching for a particular pattern in a file and outputting the results to screen or file

Grep is very cool, probably my favorite command :)

print lines that match “Apollo”


grep Apollo session1_file1.txt
## Apollo

“-v” performs a “reverse-matching” and prints only the lines that do not match the pattern.


grep -v Apollo session1_file1.txt
## Mia 
## Felix 
## KC 
## Inggat 
## TJ 
## Errol 
## Belay 
## Tabards 
## Gela 
## Anj 
## Raf 
## Tin 
## Aye 
## Janelle 
## LeQin 
## Syd 
## Reina 
## Rachel

“-i” specifies a case-insensitive match (by default this command is case sensitive).


grep -i apollo session1_file1.txt
## Apollo

more cool grep

Everyone with a on their names


grep 'a' session1_file1.txt
## Mia 
## Inggat 
## Belay 
## Tabards 
## Gela 
## Raf 
## Janelle 
## Reina 
## Rachel

Count sequences in fasta file


grep -c "^>" big_data.fastq

Breathe! This may take a while :)

Other ways of manipulating text

awk is very useful

Extract a column from a data file


awk '{ print $2 }' big_data.fastq > column2_big_data.fastq

Extract selected columns from a data file


awk '{ print $7, $8, $9 }' big_data.fastq

Or use cut


cut –f3-5 big_data.fastq

Now compare the results of awk and cut command. What did you observe?

pipe

In bash you can pipe the output from one command to another using the | symbol.

For example


ls -l | grep '\.txt$'
## -rw-r--r--@  1 apollo  staff          120 Apr  2 15:54 names_sorted.txt
## -rw-r--r--@  1 apollo  staff          120 Mar 15 11:38 newfile.txt
## -rw-r--r--@  1 apollo  staff          301 Apr  2 15:55 participants.txt
## -rw-r--r--   1 apollo  staff          120 Mar 15 11:24 session1_file1.txt
## -rw-r--r--   1 apollo  staff          181 Mar 15 11:31 session1_file2.txt

the output of the program ls -l is sent to the grep program, which, in turn, will print lines which match the regex .txt$.

To find .gz files use


ls -l | grep '\.gz$'
## -rwxr-xr-x   1 apollo  staff   4042378619 Feb 16 11:34 Pop-A-L-37_1.fastq.gz
## -rwxr-xr-x   1 apollo  staff   4295840551 Feb 16 11:34 Pop-A-L-37_2.fastq.gz
## -rwxr-xr-x   1 apollo  staff   3281988085 Feb 16 11:35 Pop-A-L-7_1.fastq.gz
## -rwxr-xr-x   1 apollo  staff   3470062588 Feb 16 11:35 Pop-A-L-7_2.fastq.gz
## -rwxr-xr-x   1 apollo  staff   3850705868 Feb 16 11:25 Pop-A-S-20_1.fastq.gz
## -rwxr-xr-x   1 apollo  staff   4093409300 Feb 16 11:25 Pop-A-S-20_2.fastq.gz
## -rwxr-xr-x   1 apollo  staff   5336438506 Feb 16 11:24 PopA-S-1_1.fastq.gz
## -rwxr-xr-x   1 apollo  staff   5539019476 Feb 16 11:24 PopA-S-1_2.fastq.gz
## -rwxr-xr-x   1 apollo  staff   1892665417 Feb 16 11:36 PopC-L-11_1.fastq.gz
## -rwxr-xr-x   1 apollo  staff   2049544307 Feb 16 11:36 PopC-L-11_2.fastq.gz
## -rwxr-xr-x   1 apollo  staff   1451179507 Feb 16 11:37 PopC-L-2_1.fastq.gz
## -rwxr-xr-x   1 apollo  staff   1520890252 Feb 16 11:37 PopC-L-2_2.fastq.gz
## -rwxr-xr-x   1 apollo  staff   3275253918 Feb 16 11:35 PopC-S-16_1.fastq.gz
## -rwxr-xr-x   1 apollo  staff   3394558311 Feb 16 11:35 PopC-S-16_2.fastq.gz
## -rwxr-xr-x   1 apollo  staff   2082049955 Feb 16 11:35 PopC-S-4_1.fastq.gz
## -rwxr-xr-x   1 apollo  staff   2197770802 Feb 16 11:35 PopC-S-4_2.fastq.gz
## -rw-r--r--@  1 apollo  staff            0 Mar 16 11:27 extracted_contigs.fastq.gz
## -rw-r--r--   1 apollo  staff      3733738 Mar 30 20:21 full_filter_nomaf_milkfish_snps_filtered.vcf.gz
## -rw-r--r--   1 apollo  staff    424144608 Mar 25 11:51 milkfish_snps_filtered.vcf.gz

This will list all the .gz files on your directory with information about permission.

Combining commands into pipeline

UNIX lets you combine virtually any commands with the pipe symbol |.

The output of the first command is used as input for the next command.

Combining grep and wc will give you the number of lines having a particular pattern:


grep session1_file1.txt | wc -l

(you can also count by using the -c command in grep, but here we are illustrating how to combine commands).

This may take a while…patience is a virtue :)

sort

Sorting columns of a tabular file can be useful for digesting large data outputs. Also very useful for sorting lists of p-values, fsts etc

This will sort a file based on whatever is in the first column.


sort session1_file1.txt > names_sorted.txt

Check the names_sorted.txt file

What happened?

If you want to sort a file based on a different column, use the -k option


sort -k 2 session1_file1.txt > names_sorted_2.txt

Why is the output empty?

sort options

-r reverse sort

Unpacking, zipping, tarballs etc

tar and gzip

tar vs gzip: tar assembles files together, gzip compresses them.

The tar command is used to create .tar.gz or .tgz archive files, also called “tarballs.”

This command has a large number of options, but you just need to remember a few letters to quickly create archives with tar.

The tar command can extract the resulting archives, too.

tar - Compress a Single File or Entire Directory

tar -czvf name-of-archive.tar.gz /path/to/directory-or-file

-c: Create an archive.

-z: Compress the archive with gzip.

-v: Display progress in the terminal while creating the archive, also known as “verbose” mode.

-f: Allows you to specify the filename of the archive.


tar -czvf test.tar.gz  *.fas

tar Extract an Archive

Once you have an archive, you can also extract it with the tar command.

The following command will extract the contents of archive.tar.gz to the current directory.


tar -xzvf archive.tar.gz

The -x switch replaces the -c switch. This specifies you want to extract an archive instead of create one.

Any questions so far?

Shell scripts

How to create a basic shell script?

A shell script is a text file with the following format:

Fig 5. Shell script.
Fig 5. Shell script.

All bash scripts start with the line #!/bin/bash and then either further comment lines that start with # or straight into commands.

You can save this as a file with a .sh in nano or touch

You will have to use chmod 777 to make it executable. Then…

Try this simple shell script

First type nano then click enter:

Then type, #!/bin/bash

type the command echo Hello World!

CLick ctrl + x, then ´yto save and name the filehello.shthenEnter`

Fig 6. Shell script.
Fig 6. Shell script.

Now, change the permission of the file by typing:


chmod 777 hello.sh

Usually the file turns ´green´ indicating that it is now an executable file

Now run the shell script that you just made


sh hello.sh
## Hello World!

Did you get the same results?

Congratulations!!!Now you have written your first shell script!!

Creating a longer script

You can try creating longer scripts that do more complex tasks or loop through processes. Try this basic for loop.

Looping is very important especially if you have so much files and you would like to automate. Doing loops in linux is much simpler than what you think

Try this!

Fig 7. Shell script.
Fig 7. Shell script.

What happened? What did it do?

Another simple for script

This time lets practice on a real `.fastq´file

Lets extract all contigs that has the sequence AAAGGGCCCTTT in the big_data.fastq file


for line in big_data.fastq
  do
    grep -B 1 -A 2 "AAAGGGCCCTTT" big_data.fastq 
    
done > extracted_contigs.fastq.gz
  

then count the number of extracted contigs


grep -c "^@" extracted_contigs.fastq.gz

This may take a while!

Try to solve this fun problems

Problem 1: Counting Words in Multiple Files

  1. Write a shell script that counts the number of words in all .txt files in your current directory and displays the file name along with the word count.

Example Output:

file1.txt: 150 words

file2.txt: 98 words

file3.txt: 200 words

Hints:

Use a for loop to iterate over .txt files.

Use wc -w to count words.

Use echo to print the results.

Answer?


#!/bin/bash

for file in *.txt; do
    word_count=$(wc -w < "$file")
    echo "$file: $word_count words"
done

Who got this answer? or a different answer?

TIP: Use AI to improve your commands or to start your commmands. Then double-check! Don’t be afraid to use AI and use it to your advantage.

Now I want you to solve this probset using any AI tools

Problem 2: File Renaming with Loop

  1. Rename all the .fastq.gz files and to processed_$file ($name of the file.fastq.gz)

Answer?


#!/bin/bash

for file in *.fastq.gz; do
    mv "$file" "processed_$file"
done

Remember:

Bash scripts are very useful, and there are numerous uses for them that allow you to run multiple commands, loop through variables etc

Practice! Practice! Practice!

No single solution to a problem

PERL & PHYTON

Why use Perl?

Interpreted language – quick to program

Easy to learn compared to most languages

Designed for working with text files

Free for all operating systems

Most popular language in bioinformatics – many scripts available you can “borrow”, also ready made modules.

Why use Phyton?

Many users now using Python which is more powerful than Perl.

Better syntax

Used beyond bioinformatics

Fewer ready made scripts available online.

Many libraries available for bioinformatics e.g Biopython.

This is just an intro - lots of resources out there!

Fig 8. Book resources.
Fig 8. Book resources.